English–Welsh Cross-Lingual Embeddings
نویسندگان
چکیده
Cross-lingual embeddings are vector space representations where word translations tend to be co-located. These enable learning transfer across languages, thus bridging the gap between data-rich languages such as English and others. In this paper, we present evaluate a suite of cross-lingual for English–Welsh language pair. To train bilingual embeddings, Welsh corpus approximately 145 M words was combined with an Wikipedia corpus. We used dictionary frame problem mappings supervised machine task, is first learned independently on monolingual corpus, after which linear alignment strategy applied map common space. Two approaches were learn including word2vec fastText. Three cross-language strategies explored, cosine similarity, inverted softmax cross-domain similarity local scaling (CSLS). evaluated different combinations these using two tasks, induction, sentiment analysis. The best results achieved fastText CSLS metric. also demonstrated that by few automatically translated training documents, performance text classifier can increase 20 percent points.
منابع مشابه
Cross-lingual Wikification Using Multilingual Embeddings
Cross-lingual Wikification is the task of grounding mentions written in non-English documents to entries in the English Wikipedia. This task involves the problem of comparing textual clues across languages, which requires developing a notion of similarity between text snippets across languages. In this paper, we address this problem by jointly training multilingual embeddings for words and Wiki...
متن کاملTrans-gram, Fast Cross-lingual Word-embeddings
We introduce Trans-gram, a simple and computationally-efficient method to simultaneously learn and align wordembeddings for a variety of languages, using only monolingual data and a smaller set of sentence-aligned data. We use our new method to compute aligned wordembeddings for twenty-one languages using English as a pivot language. We show that some linguistic features are aligned across lang...
متن کاملA Variational Autoencoding Approach for Inducing Cross-lingual Word Embeddings
Cross-language learning allows one to use training data from one language to build models for another language. Many traditional approaches require word-level alignment sentences from parallel corpora, in this paper we define a general bilingual training objective function requiring sentence level parallel corpus only. We propose a variational autoencoding approach for training bilingual word e...
متن کاملLearning Cross-lingual Word Embeddings via Matrix Co-factorization
A joint-space model for cross-lingual distributed representations generalizes language-invariant semantic features. In this paper, we present a matrix cofactorization framework for learning cross-lingual word embeddings. We explicitly define monolingual training objectives in the form of matrix decomposition, and induce cross-lingual constraints for simultaneously factorizing monolingual matric...
متن کاملCross-Lingual Word Embeddings for Low-Resource Language Modeling
Most languages have no established writing system and minimal written records. However, textual data is essential for natural language processing, and particularly important for training language models to support speech recognition. Even in cases where text data is missing, there are some languages for which bilingual lexicons are available, since creating lexicons is a fundamental task of doc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Applied sciences
سال: 2021
ISSN: ['2076-3417']
DOI: https://doi.org/10.3390/app11146541